Loan Analysis by Nirupama Puthur Venkataraman

Project Overview: An investigation into loan data to guage relations, if any between parameters of interest!

Note: As per the requirements of this project, code given below follows is limited to 80 characters or less.

All values are in dollars in terms of revenue.

Loading all of the packages needed for analysis in this code chunk.

Load the Data

Observation: Before creating any plots, we will explore the dataset.

Using commands, we will inspect structure, derive counts for rows and columns, column names, etc.

Observation:

Before creating any plots, we will explore the dataset. Using commands, will retrive counts fo rows and columns, column names, etc. We are using names(), str(), dim(), head() to inspect the dataset

## [1] 113937     81

Univariate Plots Section

Notes:

In this section, we perform some preliminary exploration of the loan dataset, run some summaries of the data and create univariate plots to understand the structure of the individual variables in the dataset. Run summary statistics on the data

Plot 1:

First, we shall analyze the different income ranges of lon recepients. While income does not correlate to loan repayment success, it is a useful factor to segment customer base. We illustrate a simple bar plot with counts of number of records in each income bracket. We can see that 25K-74.999K are the biggest loan audiences.

Plot 2:

Now, we shall analyze the estimated loss amount in each loan transaction. We can plot a histogram using the concept of bar plots. We can see that 25K-74.999K are the biggest loan audiences. As clear form the plot, this ratio is low.

Plot 3:

Next, we analyze the distribution of records in employment status brackets. This is an important metric Based on the results, any high performing segment can be favored for future loan transactions. It can be seen here that income groups with salries (Employment and Full time) form the major cohort of the loan applicants. However, as categories like Not Employed and Other are present in addition to null values, this is not a very useful stand alone variable.

Plot 4

Another important consideration while approving a loan is debt to income ratio. We study this using a bar plot. Again, as a single variable, while not very insightful, it does demonstrate that most of the applicants have favorable debt to income ratio. This is suggestive of stringent checks in the loan application process, for instance.

Plot 6

Next, we explore the distribution of original loan amount among the loan records. The principal sum, in conjunction with debt to income ratio, can give provide pointers on the success of loans. We study this using a bar plot. Most loans amounts fall in the low to mid range.

Plot 7

Next, we explore the distribution of loan amounts with geographic locations to see if a pattern exists with loan requirements. We study this using a bar plot. It is interesting to note that some states have very high number of borrowers while others barely have any. Of course, this observation to the limitations of data sampling. Potential causes include uneven sampling of data, lack of lenders and businesses.

Plot 8

Next, we explore the distribution of Credit Grades among the loan applicants. We study this using a bar plot. Since majority of the loans have not been assigned a grade, we are pointed to a good line of analytical scrutiny. It can be that the loan process/ stage is such that grades are not yet available. Or that the applicant has limited credit history, or that the loan applicant is a joint venture. In combination with loan status, this might yield better insights.

Plot 9

Next, we explore the distribution of loan term among the loan applicants. We study this using a histogram. We can conclude that 36 month loans are most popular. So, our next iteration of analysis can be to see if this is a result of rates, offerings, or if other fcators contribute.

Plot 10

Next, we explore the distribution of Monthly loan payments among the loan applicants. We study this using a bar plots.

Plot 11

Next, we explore the distribution of Employment status among the loan applicants. We study this using a Cleveland dot chart

Plot 12

Next, we explore the distribution of loan status among the loan applicants. We study this using a simple bar chart Based on our plot, we see that maximum loans are current or completed. This can bode well for the institution depending on what the financial gains are from current loans. We also can note that chargedoff, defaulted and past due loans are fewer in number which is a positive factor.

Plot 13

Further, we explore the distribution of lender yield status from the loan transactions to gauge what the feasability is. We study this using a simple bar chart We understand that lender yield, while not super high is fairly high around the mid levels.

Plot 14

Further, we explore the investigate what the typical borrower rates are We study this using a simple bar chart and get typical rates for loans

Univariate Analysis

Notes: The bar plots and histograms illustrated above are for categorical and numeric variables of interest from the loan dataset. Notes: Reflection on and summary on completion of univariate explorations. Answers below to project Questions

What is the structure of your dataset?

data.frame with 113937 obs. of 81 variables.
Thus, 113937 rows and 81 columns

What is/are the main feature(s) of interest in your dataset?

LoanOriginalAmount vs EstimatedLoss

What other features in the dataset do you think will help support your

into your feature(s) of interest? > InquiriesLast6Months, DebtToIncomeRatio, IncomeRange, LoanStatus, BorrowerState, CreditGrade, Term

Did you create any new variables from existing variables in the dataset?

No

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust,or change the form

 of the data? If so, why did you do this? > No

Bivariate Plots Section

Notes: Based on observations in the univariate plots, we explore some relationships between variables that might be interesting to look at in this section.

Plot 15

Here, we explore the relation if any, between Loan Original Amount versus Estimated Loss. We study this using a simple scatter plot As evident here, the plot is roughly negatively sloped. As the loan amount increases, estimated losses reduce. This can be a good point for further investigation. We can check if a correlation exists, if it is causal. The relation hints at many plausible reasons, such as higher background check for higher loan amounts, greater payment ability with higher borrowing ability, more steps or stronger action to reclaim larger loan amounts or simply greater incentive to repay larger loans to avoid crippling compound interests.

Plot 16

Here, we explore the relation if any, between Debt To Income Ratio versus Estimated Loss. We study this using a simple scatter plot This is an interesting plot because the concentration of points is on Y-axis. This again is a line of exploration worth pursuing further.

## $title
## [1] "Debt to Income Ratio versus Estimated Loss"
## 
## $subtitle
## NULL
## 
## attr(,"class")
## [1] "labels"

Plot 17

Here, we explore the relation if any, between Income Range versus Estimated Loss. We study this using a simple scatter plot The losses are roughly equal over the income ranges.

Plot 18

Here, we explore the relation if any, between Income Range versus Estimated Loss. We study this using a simple scatter plot The losses are roughly equal over the income ranges.

Correlations

## 
##  Pearson's product-moment correlation
## 
## data:  loanData$LoanOriginalAmount and loanData$EstimatedLoss
## t = -138.69, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4353596 -0.4243895
## sample estimates:
##        cor 
## -0.4298904
## 
##  Pearson's product-moment correlation
## 
## data:  loanData$DebtToIncomeRatio and loanData$EstimatedLoss
## t = 35.404, df = 77555, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1191835 0.1330353
## sample estimates:
##       cor 
## 0.1261155
## 
##  Pearson's product-moment correlation
## 
## data:  loanData$InquiriesLast6Months and loanData$EstimatedLoss
## t = 84.881, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2735462 0.2859500
## sample estimates:
##       cor 
## 0.2797598
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(loanData$Term) and loanData$EstimatedLoss
## t = -31.39, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1137865 -0.1004841
## sample estimates:
##        cor 
## -0.1071401
## 
##  Pearson's product-moment correlation
## 
## data:  loanData$BorrowerAPR and loanData$EstimatedLoss
## t = 881.84, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9488713 0.9501952
## sample estimates:
##       cor 
## 0.9495375

Bivariate Analysis

Notes:Summary of observations found in bivariate explorations here. Answers to the questions below are below.

Talk about some of the relationships you observed in this part of the

. How did the feature(s) of interest vary with other features in dataset? > LoanOriginalAmount vs EstimatedLoss shows higher clustering for lower values of each. This is interesting and needs to be probed further in conjunction with terms and rates. > EmploymentStatus, DebtToIncomeRatio, IncomeRange, LoanStatus, BorrowerState, CreditGrade, Term were the other variables observed. Borrower APR and Estimated Loss show a direct correlation, which can be attributed to higher rates being difficult to return and hence showing greater loss for defaulters.

Did you observe any interesting relationships between the other features?

One of the factors is Inquiries in Last 6 months which I supposed would have a high correlation with Estimated Loss. However, the correlation is weak at best. The variable Term shows a negative correlation with Estimated Loss, contrary to my initial line of thought.

What was the strongest relationship you found?

Estimated Loss versus Borrower Apr was the strongest relation among those I investigated.

Multivariate Plots Section

Notes: Based on the relationships in the bivariate plots section, creating a few multivariate plots to investigate more complex interactions between variables.

Plot 19

Next, we explore the relation if any, between Estimated Loss and Loan Original Amount but with Income range We study this using a scatter plot and colors to represent the third variable. Since estimated loss and Principal were roughly inversely proportional, we can probe to see if income range had a role to play. Also shown here is the linear model. We can note that losses reduce with increased income range. Similar to our earlier analysis, this can stem from many plausible reasons, such as higher background check for higher loan amounts, greater payment ability with higher borrowing ability, more steps or stronger action to reclaim larger loan amounts or simply greater incentive to repay larger loans to avoid crippling compound interests.

Plot 20

Next, we explore the relation if any, between Estimated Loss and Loan Original Amount but with Monthly Loan Payment We study this using a scatter plot and size to represent the third variable. Since estimated loss and Principal were roughly inversely proportional, we can probe to see if monthly payments had a role to play. We can see that the monthly payments increase with increase in original loan amount, but not contribute to any spikes in estimated loss. While not contributing new insights, it fits with the our inferences so far that higher the loan amount, lesser the estimated loss.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

. Were there features that strengthened each other in terms of at your feature(s) of interest? > On plotting Estimated Loss and Borrower APR along with Credit Grade and Income Range in the two scatterplots, we can view our two primary variables of interest along with the context of other variables like Credit, Income ranges, etc.

Were there any interesting or surprising interactions between features?

The interesting plot was the relation between Estimated Loss, Loan Original Amount and Monthly Loan Payment. It showed clearly that small to medium loan amounts with small monthly payments showed low losses. This is also supplanted by the fact that these small amount lans with low instalments are typical of medium-high income groups as evident by the preceding scatterplot (Relation between Estimated Loss, Loan Original Amount and Income Range)

OPTIONAL: Did you create any models with your dataset?

Discuss the strengths and limitations of your model.

> No

Final Plots and Summary

Summary: Most interesting findings Section

Simple Bar Plot showing number of records for borrower States

Description One

It is interesting to note that some states have very high number of borrowers while others barely have any. Of course, this observation to the limitations of data sampling. Potential causes include uneven sampling of data, lack of lenders and businesses. Since only a single categorical variable is being diplayed, a bar plot is suitable.

Scatterplot for Borrower APR and Estimated Loss

Description Two

This plot shows that as the APR increases, so do estimated losses. Complementing the simple compounding principle from the world of finance and accounts, the loan dataset clearly confirms the adage. As rates go higher, it is difficult to pay causing a loss for banker and greater instalments for the loan recepient. Since two numerical variables are bieng compared, the scatterplot is an obvious choice. The graph is supported by the correlation (value is 0.9495) done in prior sections.

Description Three

The relation between Estimated Loss, Loan Original Amount and Monthly Loan Payment is depicted using a scatterplot with color representing the third variable. As all three variables are numeric values, a scatterplot is teh perfect choice. It showed clearly that small to medium loan amounts with small monthly payments showed low losses. This is also supplanted by the fact that these small amount lans with low instalments are typical of medium-high income groups as evident by the preceding scatterplot (Relation between Estimated Loss, Loan Original Amount and Income Range)

Reflection

Notes: Due to large numer of variables, it was difficult to choose which ones to work with. My primary variables of interest were LoanOriginalAmount vs EstimatedLoss. While the correlation is not very strong, visually one can view a weak polynomial based, negatively inclined curve. Some of the assumptions I had at the outset were negated or not strongly proven by the dataset. For instance, I had assumed that Inquiries in Last 6 months would have a high correlation with Estimated Loss. However, the correlation is weak at best. Also, the variable Term shows a negative correlation with Estimated Loss, contrary to my initial line of thought. The insights from LoanOriginalAmount vs EstimatedLoss should be applied to Occupation as well and then to Credit Lines, to see if any relation exists. Also, borrower APR and Estimated Loss have a high correlation. It should be further viewed with the trader and credit line variables to see if high risk ventures correlate to loan defaulting.